This time, we will use Prosper loan data to do the following basic analysis, what we want to find is that what factors will impact the APR (Annuall Percentage Rate) and build one prediction model. Meanwhile, we want to let Borrower know how they can reduce their BorrowerAPR.
Show the variables’ basic meaning
## Variable
## 1 ListingKey
## 2 ListingNumber
## 3 ListingCreationDate
## 4 CreditGrade
## 5 Term
## 6 LoanStatus
## 7 ClosedDate
## 8 BorrowerAPR
## 9 BorrowerRate
## 10 LenderYield
## 11 EstimatedEffectiveYield
## 12 EstimatedLoss
## 13 EstimatedReturn
## 14 ProsperRating (numeric)
## 15 ProsperRating (Alpha)
## 16 ProsperScore
## 17 ListingCategory
## 18 BorrowerState
## 19 Occupation
## 20 EmploymentStatus
## 21 EmploymentStatusDuration
## 22 IsBorrowerHomeowner
## 23 CurrentlyInGroup
## 24 GroupKey
## 25 DateCreditPulled
## 26 CreditScoreRangeLower
## 27 CreditScoreRangeUpper
## 28 FirstRecordedCreditLine
## 29 CurrentCreditLines
## 30 OpenCreditLines
## 31 TotalCreditLinespast7years
## 32 OpenRevolvingAccounts
## 33 OpenRevolvingMonthlyPayment
## 34 InquiriesLast6Months
## 35 TotalInquiries
## 36 CurrentDelinquencies
## 37 AmountDelinquent
## 38 DelinquenciesLast7Years
## 39 PublicRecordsLast10Years
## 40 PublicRecordsLast12Months
## 41 RevolvingCreditBalance
## 42 BankcardUtilization
## 43 AvailableBankcardCredit
## 44 TotalTrades
## 45 TradesNeverDelinquent
## 46 TradesOpenedLast6Months
## 47 DebtToIncomeRatio
## 48 IncomeRange
## 49 IncomeVerifiable
## 50 StatedMonthlyIncome
## 51 LoanKey
## 52 TotalProsperLoans
## 53 TotalProsperPaymentsBilled
## 54 OnTimeProsperPayments
## 55 ProsperPaymentsLessThanOneMonthLate
## 56 ProsperPaymentsOneMonthPlusLate
## 57 ProsperPrincipalBorrowed
## 58 ProsperPrincipalOutstanding
## 59 ScorexChangeAtTimeOfListing
## 60 LoanCurrentDaysDelinquent
## 61 LoanFirstDefaultedCycleNumber
## 62 LoanMonthsSinceOrigination
## 63 LoanNumber
## 64 LoanOriginalAmount
## 65 LoanOriginationDate
## 66 LoanOriginationQuarter
## 67 MemberKey
## 68 MonthlyLoanPayment
## 69 LP_CustomerPayments
## 70 LP_CustomerPrincipalPayments
## 71 LP_InterestandFees
## 72 LP_ServiceFees
## 73 LP_CollectionFees
## 74 LP_GrossPrincipalLoss
## 75 LP_NetPrincipalLoss
## 76 LP_NonPrincipalRecoverypayments
## 77 PercentFunded
## 78 Recommendations
## 79 InvestmentFromFriendsCount
## 80 InvestmentFromFriendsAmount
## 81 Investors
## Description
## 1 Unique key for each listing, same value as the 'key' used in the listing object in the API.
## 2 The number that uniquely identifies the listing to the public as displayed on the website.
## 3 The date the listing was created.
## 4 The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.
## 5 The length of the loan expressed in months.
## 6 The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
## 7 Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses.
## 8 The Borrower's Annual Percentage Rate (APR) for the loan.
## 9 The Borrower's interest rate for this loan.
## 10 The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee.
## 11 Effective yield is equal to the borrower interest rate (i) minus the servicing fee rate, (ii) minus estimated uncollected interest on charge-offs, (iii) plus estimated collected late fees. Applicable for loans originated after July 2009.
## 12 Estimated loss is the estimated principal loss on charge-offs. Applicable for loans originated after July 2009.
## 13 The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Applicable for loans originated after July 2009.
## 14 The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.
## 15 The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.
## 16 A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.
## 17 The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
## 18 The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.
## 19 The Occupation selected by the Borrower at the time they created the listing.
## 20 The employment status of the borrower at the time they posted the listing.
## 21 The length in months of the employment status at the time the listing was created.
## 22 A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
## 23 Specifies whether or not the Borrower was in a group at the time the listing was created.
## 24 The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation.
## 25 The date the credit profile was pulled.
## 26 The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
## 27 The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
## 28 The date the first credit line was opened.
## 29 Number of current credit lines at the time the credit profile was pulled.
## 30 Number of open credit lines at the time the credit profile was pulled.
## 31 Number of credit lines in the past seven years at the time the credit profile was pulled.
## 32 Number of open revolving accounts at the time the credit profile was pulled.
## 33 Monthly payment on revolving accounts at the time the credit profile was pulled.
## 34 Number of inquiries in the past six months at the time the credit profile was pulled.
## 35 Total number of inquiries at the time the credit profile was pulled.
## 36 Number of accounts delinquent at the time the credit profile was pulled.
## 37 Dollars delinquent at the time the credit profile was pulled.
## 38 Number of delinquencies in the past 7 years at the time the credit profile was pulled.
## 39 Number of public records in the past 10 years at the time the credit profile was pulled.
## 40 Number of public records in the past 12 months at the time the credit profile was pulled.
## 41 Dollars of revolving credit at the time the credit profile was pulled.
## 42 The percentage of available revolving credit that is utilized at the time the credit profile was pulled.
## 43 The total available credit via bank card at the time the credit profile was pulled.
## 44 Number of trade lines ever opened at the time the credit profile was pulled.
## 45 Number of trades that have never been delinquent at the time the credit profile was pulled.
## 46 Number of trades opened in the last 6 months at the time the credit profile was pulled.
## 47 The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
## 48 The income range of the borrower at the time the listing was created.
## 49 The borrower indicated they have the required documentation to support their income.
## 50 The monthly income the borrower stated at the time the listing was created.
## 51 Unique key for each loan. This is the same key that is used in the API.
## 52 Number of Prosper loans the borrower at the time they created this listing. This value will be null if the borrower had no prior loans.
## 53 Number of on time payments the borrower made on Prosper loans at the time they created this listing. This value will be null if the borrower had no prior loans.
## 54 Number of on time payments the borrower had made on Prosper loans at the time they created this listing. This value will be null if the borrower has no prior loans.
## 55 Number of payments the borrower made on Prosper loans that were less than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.
## 56 Number of payments the borrower made on Prosper loans that were greater than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.
## 57 Total principal borrowed on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans.
## 58 Principal outstanding on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans.
## 59 Borrower's credit score change at the time the credit profile was pulled. This will be the change relative to the borrower's last Prosper loan. This value will be null if the borrower had no prior loans.
## 60 The number of days delinquent.
## 61 The cycle the loan was charged off. If the loan has not charged off the value will be null.
## 62 Number of months since the loan originated.
## 63 Unique numeric value associated with the loan.
## 64 The origination amount of the loan.
## 65 The date the loan was originated.
## 66 The quarter in which the loan was originated.
## 67 The unique key that is associated with the borrower. This is the same identifier that is used in the API member object.
## 68 The scheduled monthly loan payment.
## 69 Pre charge-off cumulative gross payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries.
## 70 Pre charge-off cumulative principal payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries.
## 71 Pre charge-off cumulative interest and fees paid by the borrower. If the loan has charged off, this value will exclude any recoveries.
## 72 Cumulative service fees paid by the investors who have invested in the loan.
## 73 Cumulative collection fees paid by the investors who have invested in the loan.
## 74 The gross charged off amount of the loan.
## 75 The principal that remains uncollected after any recoveries.
## 76 The interest and fee component of any recovery payments. The current payment policy applies payments in the following order: Fees, interest, principal.
## 77 Percent the listing was funded.
## 78 Number of recommendations the borrower had at the time the listing was created.
## 79 Number of friends that made an investment in the loan.
## 80 Dollar amount of investments that were made by friends.
## 81 The number of investors that funded the loan.
Show the structure of the data set.
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
Wow, 81 variables, since not familar with loan data features, need to understand them with the following exploration and web searching. 113937 obs, not small.
Show the summury of the data.
## ListingKey ListingNumber
## 17A93590655669644DB4C06: 6 Min. : 4
## 349D3587495831350F0F648: 4 1st Qu.: 400919
## 47C1359638497431975670B: 4 Median : 600554
## 8474358854651984137201C: 4 Mean : 627886
## DE8535960513435199406CE: 4 3rd Qu.: 892634
## 04C13599434217079754AEE: 3 Max. :1255725
## (Other) :113912
## ListingCreationDate CreditGrade Term
## 2013-10-02 17:20:16.550000000: 6 :84984 Min. :12.00
## 2013-08-28 20:31:41.107000000: 4 C : 5649 1st Qu.:36.00
## 2013-09-08 09:27:44.853000000: 4 D : 5153 Median :36.00
## 2013-12-06 05:43:13.830000000: 4 B : 4389 Mean :40.83
## 2013-12-06 11:44:58.283000000: 4 AA : 3509 3rd Qu.:36.00
## 2013-08-21 07:25:22.360000000: 3 HR : 3508 Max. :60.00
## (Other) :113912 (Other): 6745
## LoanStatus ClosedDate
## Current :56576 :58848
## Completed :38074 2014-03-04 00:00:00: 105
## Chargedoff :11992 2014-02-19 00:00:00: 100
## Defaulted : 5018 2014-02-11 00:00:00: 92
## Past Due (1-15 days) : 806 2012-10-30 00:00:00: 81
## Past Due (31-60 days): 363 2013-02-26 00:00:00: 78
## (Other) : 1108 (Other) :54633
## BorrowerAPR BorrowerRate LenderYield
## Min. :0.00653 Min. :0.0000 Min. :-0.0100
## 1st Qu.:0.15629 1st Qu.:0.1340 1st Qu.: 0.1242
## Median :0.20976 Median :0.1840 Median : 0.1730
## Mean :0.21883 Mean :0.1928 Mean : 0.1827
## 3rd Qu.:0.28381 3rd Qu.:0.2500 3rd Qu.: 0.2400
## Max. :0.51229 Max. :0.4975 Max. : 0.4925
## NA's :25
## EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## Min. :-0.183 Min. :0.005 Min. :-0.183
## 1st Qu.: 0.116 1st Qu.:0.042 1st Qu.: 0.074
## Median : 0.162 Median :0.072 Median : 0.092
## Mean : 0.169 Mean :0.080 Mean : 0.096
## 3rd Qu.: 0.224 3rd Qu.:0.112 3rd Qu.: 0.117
## Max. : 0.320 Max. :0.366 Max. : 0.284
## NA's :29084 NA's :29084 NA's :29084
## ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## Min. :1.000 :29084 Min. : 1.00
## 1st Qu.:3.000 C :18345 1st Qu.: 4.00
## Median :4.000 B :15581 Median : 6.00
## Mean :4.072 A :14551 Mean : 5.95
## 3rd Qu.:5.000 D :14274 3rd Qu.: 8.00
## Max. :7.000 E : 9795 Max. :11.00
## NA's :29084 (Other):12307 NA's :29084
## ListingCategory..numeric. BorrowerState
## Min. : 0.000 CA :14717
## 1st Qu.: 1.000 TX : 6842
## Median : 1.000 NY : 6729
## Mean : 2.774 FL : 6720
## 3rd Qu.: 3.000 IL : 5921
## Max. :20.000 : 5515
## (Other):67493
## Occupation EmploymentStatus
## Other :28617 Employed :67322
## Professional :13628 Full-time :26355
## Computer Programmer : 4478 Self-employed: 6134
## Executive : 4311 Not available: 5347
## Teacher : 3759 Other : 3806
## Administrative Assistant: 3688 : 2255
## (Other) :55456 (Other) : 2718
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## Min. : 0.00 False:56459 False:101218
## 1st Qu.: 26.00 True :57478 True : 12719
## Median : 67.00
## Mean : 96.07
## 3rd Qu.:137.00
## Max. :755.00
## NA's :7625
## GroupKey DateCreditPulled
## :100596 2013-12-23 09:38:12: 6
## 783C3371218786870A73D20: 1140 2013-11-21 09:09:41: 4
## 3D4D3366260257624AB272D: 916 2013-12-06 05:43:16: 4
## 6A3B336601725506917317E: 698 2014-01-14 20:17:49: 4
## FEF83377364176536637E50: 611 2014-02-09 12:14:41: 4
## C9643379247860156A00EC0: 342 2013-09-27 22:04:54: 3
## (Other) : 9634 (Other) :113912
## CreditScoreRangeLower CreditScoreRangeUpper
## Min. : 0.0 Min. : 19.0
## 1st Qu.:660.0 1st Qu.:679.0
## Median :680.0 Median :699.0
## Mean :685.6 Mean :704.6
## 3rd Qu.:720.0 3rd Qu.:739.0
## Max. :880.0 Max. :899.0
## NA's :591 NA's :591
## FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
## : 697 Min. : 0.00 Min. : 0.00
## 1993-12-01 00:00:00: 185 1st Qu.: 7.00 1st Qu.: 6.00
## 1994-11-01 00:00:00: 178 Median :10.00 Median : 9.00
## 1995-11-01 00:00:00: 168 Mean :10.32 Mean : 9.26
## 1990-04-01 00:00:00: 161 3rd Qu.:13.00 3rd Qu.:12.00
## 1995-03-01 00:00:00: 159 Max. :59.00 Max. :54.00
## (Other) :112389 NA's :7604 NA's :7604
## TotalCreditLinespast7years OpenRevolvingAccounts
## Min. : 2.00 Min. : 0.00
## 1st Qu.: 17.00 1st Qu.: 4.00
## Median : 25.00 Median : 6.00
## Mean : 26.75 Mean : 6.97
## 3rd Qu.: 35.00 3rd Qu.: 9.00
## Max. :136.00 Max. :51.00
## NA's :697
## OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries
## Min. : 0.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 114.0 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 271.0 Median : 1.000 Median : 4.000
## Mean : 398.3 Mean : 1.435 Mean : 5.584
## 3rd Qu.: 525.0 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :14985.0 Max. :105.000 Max. :379.000
## NA's :697 NA's :1159
## CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.0000 Median : 0.0 Median : 0.000
## Mean : 0.5921 Mean : 984.5 Mean : 4.155
## 3rd Qu.: 0.0000 3rd Qu.: 0.0 3rd Qu.: 3.000
## Max. :83.0000 Max. :463881.0 Max. :99.000
## NA's :697 NA's :7622 NA's :990
## PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
## Min. : 0.0000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 3121
## Median : 0.0000 Median : 0.000 Median : 8549
## Mean : 0.3126 Mean : 0.015 Mean : 17599
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 19521
## Max. :38.0000 Max. :20.000 Max. :1435667
## NA's :697 NA's :7604 NA's :7604
## BankcardUtilization AvailableBankcardCredit TotalTrades
## Min. :0.000 Min. : 0 Min. : 0.00
## 1st Qu.:0.310 1st Qu.: 880 1st Qu.: 15.00
## Median :0.600 Median : 4100 Median : 22.00
## Mean :0.561 Mean : 11210 Mean : 23.23
## 3rd Qu.:0.840 3rd Qu.: 13180 3rd Qu.: 30.00
## Max. :5.950 Max. :646285 Max. :126.00
## NA's :7604 NA's :7544 NA's :7544
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months
## Min. :0.000 Min. : 0.000
## 1st Qu.:0.820 1st Qu.: 0.000
## Median :0.940 Median : 0.000
## Mean :0.886 Mean : 0.802
## 3rd Qu.:1.000 3rd Qu.: 1.000
## Max. :1.000 Max. :20.000
## NA's :7544 NA's :7544
## DebtToIncomeRatio IncomeRange IncomeVerifiable
## Min. : 0.000 $25,000-49,999:32192 False: 8669
## 1st Qu.: 0.140 $50,000-74,999:31050 True :105268
## Median : 0.220 $100,000+ :17337
## Mean : 0.276 $75,000-99,999:16916
## 3rd Qu.: 0.320 Not displayed : 7741
## Max. :10.010 $1-24,999 : 7274
## NA's :8554 (Other) : 1427
## StatedMonthlyIncome LoanKey TotalProsperLoans
## Min. : 0 CB1B37030986463208432A1: 6 Min. :0.00
## 1st Qu.: 3200 2DEE3698211017519D7333F: 4 1st Qu.:1.00
## Median : 4667 9F4B37043517554537C364C: 4 Median :1.00
## Mean : 5608 D895370150591392337ED6D: 4 Mean :1.42
## 3rd Qu.: 6825 E6FB37073953690388BC56D: 4 3rd Qu.:2.00
## Max. :1750003 0D8F37036734373301ED419: 3 Max. :8.00
## (Other) :113912 NA's :91852
## TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 9.00
## Median : 16.00 Median : 15.00
## Mean : 22.93 Mean : 22.27
## 3rd Qu.: 33.00 3rd Qu.: 32.00
## Max. :141.00 Max. :141.00
## NA's :91852 NA's :91852
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00
## Mean : 0.61 Mean : 0.05
## 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :42.00 Max. :21.00
## NA's :91852 NA's :91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 0 Min. : 0
## 1st Qu.: 3500 1st Qu.: 0
## Median : 6000 Median : 1627
## Mean : 8472 Mean : 2930
## 3rd Qu.:11000 3rd Qu.: 4127
## Max. :72499 Max. :23451
## NA's :91852 NA's :91852
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-209.00 Min. : 0.0
## 1st Qu.: -35.00 1st Qu.: 0.0
## Median : -3.00 Median : 0.0
## Mean : -3.22 Mean : 152.8
## 3rd Qu.: 25.00 3rd Qu.: 0.0
## Max. : 286.00 Max. :2704.0
## NA's :95009
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. : 0.00 Min. : 0.0 Min. : 1
## 1st Qu.: 9.00 1st Qu.: 6.0 1st Qu.: 37332
## Median :14.00 Median : 21.0 Median : 68599
## Mean :16.27 Mean : 31.9 Mean : 69444
## 3rd Qu.:22.00 3rd Qu.: 65.0 3rd Qu.:101901
## Max. :44.00 Max. :100.0 Max. :136486
## NA's :96985
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1000 2014-01-22 00:00:00: 491 Q4 2013:14450
## 1st Qu.: 4000 2013-11-13 00:00:00: 490 Q1 2014:12172
## Median : 6500 2014-02-19 00:00:00: 439 Q3 2013: 9180
## Mean : 8337 2013-10-16 00:00:00: 434 Q2 2013: 7099
## 3rd Qu.:12000 2014-01-28 00:00:00: 339 Q3 2012: 5632
## Max. :35000 2013-09-24 00:00:00: 316 Q2 2012: 5061
## (Other) :111428 (Other):60343
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## 63CA34120866140639431C9: 9 Min. : 0.0 Min. : -2.35
## 16083364744933457E57FB9: 8 1st Qu.: 131.6 1st Qu.: 1005.76
## 3A2F3380477699707C81385: 8 Median : 217.7 Median : 2583.83
## 4D9C3403302047712AD0CDD: 8 Mean : 272.5 Mean : 4183.08
## 739C338135235294782AE75: 8 3rd Qu.: 371.6 3rd Qu.: 5548.40
## 7E1733653050264822FAA3D: 8 Max. :2251.5 Max. :40702.39
## (Other) :113888
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0.0 Min. : -2.35 Min. :-664.87
## 1st Qu.: 500.9 1st Qu.: 274.87 1st Qu.: -73.18
## Median : 1587.5 Median : 700.84 Median : -34.44
## Mean : 3105.5 Mean : 1077.54 Mean : -54.73
## 3rd Qu.: 4000.0 3rd Qu.: 1458.54 3rd Qu.: -13.92
## Max. :35000.0 Max. :15617.03 Max. : 32.06
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-9274.75 Min. : -94.2 Min. : -954.5
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.00 Median : 0.0 Median : 0.0
## Mean : -14.24 Mean : 700.4 Mean : 681.4
## 3rd Qu.: 0.00 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. : 0.00 Max. :25000.0 Max. :25000.0
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. : 0.00 Min. :0.7000 Min. : 0.00000
## 1st Qu.: 0.00 1st Qu.:1.0000 1st Qu.: 0.00000
## Median : 0.00 Median :1.0000 Median : 0.00000
## Mean : 25.14 Mean :0.9986 Mean : 0.04803
## 3rd Qu.: 0.00 3rd Qu.:1.0000 3rd Qu.: 0.00000
## Max. :21117.90 Max. :1.0125 Max. :39.00000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. : 0.00000 Min. : 0.00 Min. : 1.00
## 1st Qu.: 0.00000 1st Qu.: 0.00 1st Qu.: 2.00
## Median : 0.00000 Median : 0.00 Median : 44.00
## Mean : 0.02346 Mean : 16.55 Mean : 80.48
## 3rd Qu.: 0.00000 3rd Qu.: 0.00 3rd Qu.: 115.00
## Max. :33.00000 Max. :25000.00 Max. :1189.00
##
Why so many NA values? Part of them have the same NA count, what causes the NA value? Confused. The information should be calculated automatically, e.g., EstimatedEffectiveYield.
Why duplicate ListingKey? Subset the duplicate ListingKey data for example.
## ListingKey ListingNumber ListingCreationDate
## 13079 17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000
## 14889 17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000
## 20570 17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000
## 31451 17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000
## 42751 17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000
## 42752 17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000
## CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate
## 13079 60 Current 0.16662 0.1435
## 14889 60 Current 0.16662 0.1435
## 20570 60 Current 0.16662 0.1435
## 31451 60 Current 0.16662 0.1435
## 42751 60 Current 0.16662 0.1435
## 42752 60 Current 0.16662 0.1435
## LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## 13079 0.1335 0.1264 0.0524 0.074
## 14889 0.1335 0.1264 0.0524 0.074
## 20570 0.1335 0.1264 0.0524 0.074
## 31451 0.1335 0.1264 0.0524 0.074
## 42751 0.1335 0.1264 0.0524 0.074
## 42752 0.1335 0.1264 0.0524 0.074
## ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## 13079 5 B 4
## 14889 5 B 8
## 20570 5 B 7
## 31451 5 B 10
## 42751 5 B 5
## 42752 5 B 6
## ListingCategory..numeric. BorrowerState Occupation EmploymentStatus
## 13079 1 MD Other Employed
## 14889 1 MD Other Employed
## 20570 1 MD Other Employed
## 31451 1 MD Other Employed
## 42751 1 MD Other Employed
## 42752 1 MD Other Employed
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## 13079 26 False False
## 14889 26 False False
## 20570 26 False False
## 31451 26 False False
## 42751 26 False False
## 42752 26 False False
## GroupKey DateCreditPulled CreditScoreRangeLower
## 13079 2013-12-23 09:38:12 720
## 14889 2013-12-23 09:38:12 720
## 20570 2013-12-23 09:38:12 720
## 31451 2013-12-23 09:38:12 720
## 42751 2013-12-23 09:38:12 720
## 42752 2013-12-23 09:38:12 720
## CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines
## 13079 739 1986-12-26 00:00:00 12
## 14889 739 1986-12-26 00:00:00 12
## 20570 739 1986-12-26 00:00:00 12
## 31451 739 1986-12-26 00:00:00 12
## 42751 739 1986-12-26 00:00:00 12
## 42752 739 1986-12-26 00:00:00 12
## OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts
## 13079 12 20 6
## 14889 12 20 6
## 20570 12 20 6
## 31451 12 20 6
## 42751 12 20 6
## 42752 12 20 6
## OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries
## 13079 348 0 5
## 14889 348 0 5
## 20570 348 0 5
## 31451 348 0 5
## 42751 348 0 5
## 42752 348 0 5
## CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
## 13079 0 0 0
## 14889 0 0 0
## 20570 0 0 0
## 31451 0 0 0
## 42751 0 0 0
## 42752 0 0 0
## PublicRecordsLast10Years PublicRecordsLast12Months
## 13079 0 0
## 14889 0 0
## 20570 0 0
## 31451 0 0
## 42751 0 0
## 42752 0 0
## RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit
## 13079 14635 0.57 10865
## 14889 14635 0.57 10865
## 20570 14635 0.57 10865
## 31451 14635 0.57 10865
## 42751 14635 0.57 10865
## 42752 14635 0.57 10865
## TotalTrades TradesNeverDelinquent..percentage.
## 13079 17 1
## 14889 17 1
## 20570 17 1
## 31451 17 1
## 42751 17 1
## 42752 17 1
## TradesOpenedLast6Months DebtToIncomeRatio IncomeRange
## 13079 0 0.41 $25,000-49,999
## 14889 0 0.41 $25,000-49,999
## 20570 0 0.41 $25,000-49,999
## 31451 0 0.41 $25,000-49,999
## 42751 0 0.41 $25,000-49,999
## 42752 0 0.41 $25,000-49,999
## IncomeVerifiable StatedMonthlyIncome LoanKey
## 13079 True 3000 CB1B37030986463208432A1
## 14889 True 3000 CB1B37030986463208432A1
## 20570 True 3000 CB1B37030986463208432A1
## 31451 True 3000 CB1B37030986463208432A1
## 42751 True 3000 CB1B37030986463208432A1
## 42752 True 3000 CB1B37030986463208432A1
## TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments
## 13079 NA NA NA
## 14889 NA NA NA
## 20570 NA NA NA
## 31451 NA NA NA
## 42751 NA NA NA
## 42752 NA NA NA
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## 13079 NA NA
## 14889 NA NA
## 20570 NA NA
## 31451 NA NA
## 42751 NA NA
## 42752 NA NA
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## 13079 NA NA
## 14889 NA NA
## 20570 NA NA
## 31451 NA NA
## 42751 NA NA
## 42752 NA NA
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## 13079 NA 0
## 14889 NA 0
## 20570 NA 0
## 31451 NA 0
## 42751 NA 0
## 42752 NA 0
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## 13079 NA 2 126059
## 14889 NA 2 126059
## 20570 NA 2 126059
## 31451 NA 2 126059
## 42751 NA 2 126059
## 42752 NA 2 126059
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## 13079 10000 2014-01-13 00:00:00 Q1 2014
## 14889 10000 2014-01-13 00:00:00 Q1 2014
## 20570 10000 2014-01-13 00:00:00 Q1 2014
## 31451 10000 2014-01-13 00:00:00 Q1 2014
## 42751 10000 2014-01-13 00:00:00 Q1 2014
## 42752 10000 2014-01-13 00:00:00 Q1 2014
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## 13079 F80D3694083622957BA09F2 234.5 234.5
## 14889 F80D3694083622957BA09F2 234.5 234.5
## 20570 F80D3694083622957BA09F2 234.5 234.5
## 31451 F80D3694083622957BA09F2 234.5 234.5
## 42751 F80D3694083622957BA09F2 234.5 234.5
## 42752 F80D3694083622957BA09F2 234.5 234.5
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## 13079 112.62 121.88 -8.49
## 14889 112.62 121.88 -8.49
## 20570 112.62 121.88 -8.49
## 31451 112.62 121.88 -8.49
## 42751 112.62 121.88 -8.49
## 42752 112.62 121.88 -8.49
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## 13079 0 0 0
## 14889 0 0 0
## 20570 0 0 0
## 31451 0 0 0
## 42751 0 0 0
## 42752 0 0 0
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## 13079 0 1 0
## 14889 0 1 0
## 20570 0 1 0
## 31451 0 1 0
## 42751 0 1 0
## 42752 0 1 0
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## 13079 0 0 96
## 14889 0 0 96
## 20570 0 0 96
## 31451 0 0 96
## 42751 0 0 96
## 42752 0 0 96
The only difference is ProsperScore, how will cause the ProsperScore to change? not understood. So, for each loan data, if prosperScore changes, will be saved several times?
The main objective for this article is to find what factors will impact the BorrowerAPR, so we want to know the BorrowerAPR distribution firstly.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229 25
The most frequent BorrowerAPR should still be around 0.2. Another peak is arounk 0.36
Then we want to explore each variable one by one
## AA A B C D E NA's
## 84984 3509 3315 4389 5649 5153 3289 3649
C Credit Grade is with big probabiltiy, lots of loan records have no CreditGrade information, this is reasonable, because CreditGrade is used for assessing the loan before 2009 July. After 2009 July, we will use ProsperRating for each loan. As we know, CreditGrade or ProsperRating should be one import factor that impacts the APR. There’s no “HR” level in CreditGrade, is ‘NA’ ‘HR’ level?
Something wrong here, NA value should be ‘HR’, change NA to HR.
## AA A B C D E HR
## 84984 3509 3315 4389 5649 5153 3289 3649
## 12 36 60
## 1614 87778 24545
Doubt that may Term 12 has been canceled in the latest prosper loan, however, from the data, the creation time is not old, so this thought is wrong.
There are just three values for Term variable, the most frequent one is 36, three years.
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
This LoanStatus is not the feature we care about for BorrowerAPR prediction, however, this one may can be used for predicting what kind of loan will be charged-off. This status WOW me, the probabiltiy for defaulted and charged-off is not small.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
This feature is highly related with BorrowerAPR, BorrowerAPR = BorrowerRate + OrganizationFee. Will check whether organizationFee changes with Credit Grade or not, from the introduction is Prosper company, seems yes.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 3.000 4.000 4.072 5.000 7.000 29084
##
## 1 2 3 4 5 6 7
## 6935 9795 14274 18345 15581 14551 5372
Numerice and alpha value describe the same thing, so can just keep one
## AA A B C D E NA's
## 29084 5372 14551 15581 18345 14274 9795 6935
Something wrong here, NA value should be ‘HR’, change NA to HR.
## AA A B C D E HR
## 29084 5372 14551 15581 18345 14274 9795 6935
As we talked before, combine CreditGrade and ProsperRating two columns to one column CreditRating that can describe the credit value.
## A AA B C D E HR
## 131 17866 8881 19970 23994 19427 13084 10584
Still have 131 loans that are with no CreditRating information.
## AA A B C D E HR
## 131 8881 17866 19970 23994 19427 13084 10584
All the credit information is combined. best -> worst, ‘AA’ -> ‘HR’.
## 1 2 3 4 5 6 7 8 9 10 11 NA's
## 992 5766 7642 12595 9813 12278 10597 12053 6911 4750 1456 29084
why there’s 11? in data decription file, 10 should be the highest value. What ever, best -> worst, 11 -> 1.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 2.774 3.000 20.000
The biggest probability is used for Debt consolication.
## AK AL AR AZ CA CO CT DC DE FL GA
## 5515 200 1679 855 1901 14717 2210 1627 382 300 6720 5008
## HI IA ID IL IN KS KY LA MA MD ME MI
## 409 186 599 5921 2078 1062 983 954 2242 2821 101 3593
## MN MO MS MT NC ND NE NH NJ NM NV NY
## 2318 2615 787 330 3084 52 674 551 3097 472 1090 6729
## OH OK OR PA RI SC SD TN TX UT VA VT
## 4197 971 1817 2972 435 1122 189 1737 6842 877 3278 207
## WA WI WV WY
## 3048 1842 391 150
Did not understand the meaning of the two letter abbreviation.
## Accountant/CPA
## 3588 3233
## Administrative Assistant Analyst
## 3688 3602
## Architect Attorney
## 213 1046
## Biologist Bus Driver
## 125 316
## Car Dealer Chemist
## 180 145
## Civil Service Clergy
## 1457 196
## Clerical Computer Programmer
## 3164 4478
## Construction Dentist
## 1790 68
## Doctor Engineer - Chemical
## 494 225
## Engineer - Electrical Engineer - Mechanical
## 1125 1406
## Executive Fireman
## 4311 422
## Flight Attendant Food Service
## 123 1123
## Food Service Management Homemaker
## 1239 120
## Investor Judge
## 214 22
## Laborer Landscaping
## 1595 236
## Medical Technician Military Enlisted
## 1117 1272
## Military Officer Nurse (LPN)
## 346 492
## Nurse (RN) Nurse's Aide
## 2489 491
## Other Pharmacist
## 28617 257
## Pilot - Private/Commercial Police Officer/Correction Officer
## 199 1578
## Postal Service Principal
## 627 312
## Professional Professor
## 13628 557
## Psychologist Realtor
## 145 543
## Religious Retail Management
## 124 2602
## Sales - Commission Sales - Retail
## 3446 2797
## Scientist Skilled Labor
## 372 2746
## Social Worker Student - College Freshman
## 741 41
## Student - College Graduate Student Student - College Junior
## 245 112
## Student - College Senior Student - College Sophomore
## 188 69
## Student - Community College Student - Technical School
## 28 16
## Teacher Teacher's Aide
## 3759 276
## Tradesman - Carpenter Tradesman - Electrician
## 120 477
## Tradesman - Mechanic Tradesman - Plumber
## 951 102
## Truck Driver Waiter/Waitress
## 1675 436
The occupation should not be one key feature.
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
Want to combine the levels to just two, employed and not employed.
## Employed Not employed
## 113102 835
Most borrowers are employed.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 26.00 67.00 96.07 137.00 755.00 7625
## False True
## 56459 57478
False and True are nearly 50% and 50%.
CreditScoreRangeLower and Upper should be combined into one range column, like income range.
## [0-19] [360-379] [420-439] [440-459] [460-479] [480-499] [500-519]
## 133 1 5 36 141 346 554
## [520-539] [540-559] [560-579] [580-599] [600-619] [620-639] [640-659]
## 1593 1474 1357 1125 3602 4172 12199
## [660-679] [680-699] [700-719] [720-739] [740-759] [760-779] [780-799]
## 16366 16492 15471 12923 9267 6606 4624
## [800-819] [820-839] [840-859] [860-879] [880-899] NA's
## 2644 1409 567 212 27 591
Most Borrowers credit score in range 640 - 740. The uppper value = lower value + 19, so we can just keep one for the next revision.
Try to build a new feature to reduce the CreditScoreRange level so check whether will improve the relationship
Add one variable to judge the length of credit history, the longer history, should the lower APR.
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The unit is day.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.00 6.00 6.97 9.00 51.00
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 3506 4989 7557 9901 11315 11928 11545 10220 8705 7317 5875 4696
## 12 13 14 15 16 17 18 19 20 21 22 23
## 3678 2875 2277 1775 1297 1000 760 630 470 360 278 196
## 24 25 26 27 28 29 30 31 32 33 34 35
## 185 126 103 80 57 58 42 29 26 12 14 12
## 36 37 38 39 40 41 44 46 47 49 50 51
## 10 5 6 5 4 5 1 2 2 1 1 1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 1.000 1.435 2.000 105.000 697
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 50005 28621 14432 7697 4297 2610 1664 1014 696 508 372 275
## 12 13 14 15 16 17 18 19 20 21 22 23
## 207 163 128 98 79 64 53 33 30 40 22 18
## 24 25 26 27 28 29 30 31 32 33 34 35
## 16 14 14 8 8 6 4 10 4 2 4 4
## 36 37 38 40 41 42 44 46 50 52 53 63
## 1 3 2 3 1 1 2 1 1 1 1 1
## 97 105
## 1 1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.000 4.000 5.584 7.000 379.000 1159
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 8430 13785 14887 13934 12148 10098 7607 6171 4692 3779 2914 2431
## 12 13 14 15 16 17 18 19 20 21 22 23
## 1786 1453 1245 978 864 724 581 539 428 372 347 301
## 24 25 26 27 28 29 30 31 32 33 34 35
## 231 205 198 176 146 113 120 90 104 83 58 69
## 36 37 38 39 40 41 42 43 44 45 46 47
## 65 50 51 42 38 32 32 27 29 20 32 25
## 48 49 50 51 52 53 54 55 56 57 58 59
## 16 15 15 14 14 10 9 9 11 5 12 3
## 60 61 62 63 64 65 66 67 68 69 70 71
## 4 9 8 7 6 6 7 4 5 1 6 5
## 72 74 75 76 77 78 79 80 82 83 85 86
## 1 4 1 1 2 3 2 1 2 1 3 1
## 87 88 89 90 93 95 96 97 103 105 106 109
## 2 1 1 3 2 1 2 2 1 1 1 2
## 112 113 117 158 377 379
## 1 1 1 1 1 1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.5921 0.0000 83.0000 697
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 89742 11716 4357 2098 1379 916 690 517 397 289 212 191
## 12 13 14 15 16 17 18 19 20 21 22 23
## 147 111 71 83 58 40 37 28 27 31 21 9
## 24 25 26 27 28 30 31 32 33 35 36 37
## 12 5 8 12 5 2 6 5 1 2 2 1
## 39 40 41 45 50 51 57 59 64 82 83
## 1 1 2 1 1 1 1 1 1 1 1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 984.5 0.0 463881.0 7622
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.155 3.000 99.000 990
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 76439 3967 2879 3183 2592 1826 1790 1648 1421 1208 1151 1075
## 12 13 14 15 16 17 18 19 20 21 22 23
## 982 873 821 795 731 608 574 540 565 472 421 439
## 24 25 26 27 28 29 30 31 32 33 34 35
## 423 347 330 317 296 287 248 214 225 190 190 201
## 36 37 38 39 40 41 42 43 44 45 46 47
## 147 153 144 148 113 106 128 101 110 81 90 94
## 48 49 50 51 52 53 54 55 56 57 58 59
## 78 74 72 72 55 40 40 39 53 30 31 34
## 60 61 62 63 64 65 66 67 68 69 70 71
## 41 34 36 31 28 34 27 22 20 20 15 13
## 72 73 74 75 76 77 78 79 80 81 82 83
## 14 17 9 22 10 15 10 8 12 4 12 6
## 84 85 86 87 88 89 90 91 92 93 94 95
## 8 3 7 7 9 5 7 4 6 2 3 4
## 96 97 98 99
## 4 4 3 110
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.3126 0.0000 38.0000 697
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 85803 22834 3011 894 345 151 70 46 31 15 8 7
## 12 13 14 15 16 17 20 21 22 25 30 34
## 4 1 4 3 5 1 1 1 1 1 1 1
## 38
## 1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.015 0.000 20.000 7604
##
## 0 1 2 3 4 7 20
## 104941 1255 96 28 10 2 1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3121 8549 17599 19521 1435667 7604
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.310 0.600 0.561 0.840 5.950 7604
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 880 4100 11210 13180 646285 7544
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.820 0.940 0.886 1.000 1.000 7544
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
What’s the meaning of ‘not displayed’, what’s the difference between ‘not employed’ and ‘$0’?
## False True
## 8669 105268
There are 113937 obs. of 81 variables in this dataset. Each observation is one loan record. The dataset is collected from Prosper webBank, who is America’s frist peer-to-peer lending marketplace. Borrowers request personal loans and inverstor fund. Knowing more background will make us understand the data more easily and clearly.
Since this is one loan dataset, we care most is the BorrowerAPR. We want to known what factors will impact the BorrowerAPR and build one prediction model. The basic one should be CreditRating, however, some combination of other variables should also be used to build the prediction model.
investigation into your feature(s) of interest?
In my view, the features may imapct the BorrowerAPR includes at least: ProsperScore, EmploymentStatus, EmploymentFlag, IsBorrowerHomeowener, CreditScoreRange, OpenRevolvingAccounts, InquireisLast6Months, CurrentDelinquencies, AmountDelinquent, DelinquenciesLast7Years, PublicRecordsLast10Years, PublicRecordsLast12Months, BankcardUtilization, AvailableBankcardCredit, TradesNeverDelinquent, DebtToIncomeRatio, IncomeRange, IncomeVerifiable, LengthHistory etc.
After lots of research, find that the five important components for credit score, i.e., payment history, credit utilization, length of credit history, new credit and credit mix. The features in data set almost all have relationship with the five components. However, except CreditRating, ProsperScore, CreditScoreRange, which already take acount of the whole history information, Delinquency part, PublicRecords part, BankcardUtilization, AvailableBankcardCredit, DebtToIncomeRatio, LengthHistory should be the most related features.
Since CreditGrade is used for period before July, 2009 and ProsperRating is used fo period after July, 2009, i combine them to one CreditRating feature.
Create a variable EmploymentFlag to classify the emopolyment status to employed and not employed, will check more in the Bivariate Plots Section.
Create a variable CreditScoreRange to describe CrediteScoreRangeLower and Upper more directly. And then create a variable CreditRevision to reduce the levels.
Create a variable LengthHistory to judge the length of credit history.
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?
GreditGrade and ProsperRating miss ‘HR’ level, change the NA value to ‘HR’, since ProsperRating(numeric) use 1 for ‘HR’.
Change Term, ProsperScore to factor, since just limited levels.
Change FirstRecordedCreditLine and ListingCreationDate to Datetime
Select the variables we most care about to get the subset to calculate the relationship between each other
Correlation matrix
## BA BR CR T PS
## BA 1.000000000 0.989823970 0.87189714 -0.011183469 -0.668287197
## BR 0.989823970 1.000000000 0.87562059 0.020085366 -0.649736144
## CR 0.871897136 0.875620587 1.00000000 -0.070009142 -0.705221420
## T -0.011183469 0.020085366 -0.07000914 1.000000000 0.028946520
## PS -0.668287197 -0.649736144 -0.70522142 0.028946520 1.000000000
## LC 0.132455835 0.102913488 0.05017470 0.004947144 -0.009717877
## EF 0.060873776 0.058930969 0.04963288 -0.018943759 -0.023713085
## IB -0.132822618 -0.134430562 -0.19500074 0.085339314 0.064437777
## CS -0.451327562 -0.483434065 -0.65953775 0.129818307 0.369603047
## TC 0.002513417 -0.005793111 -0.05396318 0.076527775 -0.037251547
## IL 0.146119308 0.183810023 0.21623650 -0.113568019 -0.296761859
## TI 0.114546407 0.153128748 0.18666351 -0.103132404 -0.215766181
## CD 0.149403936 0.176530084 0.24818167 -0.083807367 -0.100612243
## AD 0.065679195 0.065644674 0.06419472 -0.016458723 -0.041600592
## DL 0.162225391 0.170278704 0.19500056 -0.041492379 -0.097754738
## PR 0.044094991 0.051168539 0.05445116 -0.026251691 -0.014884521
## BU 0.261438040 0.255482029 0.29675023 0.031535353 -0.244695570
## AB -0.348926135 -0.343861143 -0.38814023 0.015347737 0.318558038
## TT -0.041893875 -0.048210682 -0.08578411 0.079650347 -0.011130754
## TN -0.241348883 -0.261189459 -0.31617912 0.119341932 0.129893694
## DI 0.056327417 0.062916780 0.05011206 -0.014670053 -0.145335892
## IR -0.055260019 -0.033849986 0.02670477 -0.015797041 0.020201444
## IV -0.109974504 -0.099540159 -0.06735857 0.040402465 0.154618878
## LH -0.028707233 -0.052822826 -0.11367644 0.103460569 0.021765547
## CSR -0.478930786 -0.494403160 -0.61788683 0.101562235 0.358797768
## LC EF IB CS TC
## BA 0.132455835 6.087378e-02 -0.1328226179 -0.45132756 0.002513417
## BR 0.102913488 5.893097e-02 -0.1344305618 -0.48343406 -0.005793111
## CR 0.050174703 4.963288e-02 -0.1950007397 -0.65953775 -0.053963183
## T 0.004947144 -1.894376e-02 0.0853393140 0.12981831 0.076527775
## PS -0.009717877 -2.371309e-02 0.0644377769 0.36960305 -0.037251547
## LC 1.000000000 3.193896e-02 -0.0382242873 0.10301591 -0.038400838
## EF 0.031938962 1.000000e+00 -0.0430626838 0.01024205 -0.041491813
## IB -0.038224287 -4.306268e-02 1.0000000000 0.30235721 0.293586654
## CS 0.103015914 1.024205e-02 0.3023572068 1.00000000 0.105728412
## TC -0.038400838 -4.149181e-02 0.2935866539 0.10572841 1.000000000
## IL -0.072643941 -2.007191e-02 0.0068929738 -0.26938598 0.072628893
## TI -0.091324751 -3.238113e-02 0.0665960662 -0.29323289 0.168401756
## CD -0.049935645 -7.477945e-03 -0.0554537080 -0.37388191 0.067600665
## AD 0.022202378 -2.616979e-03 0.0381222589 -0.06584937 0.050983528
## DL 0.016949523 -1.314943e-02 -0.0707983959 -0.25983065 0.146574191
## PR 0.003167448 2.847180e-04 -0.0150161212 -0.08344302 0.005523972
## BU -0.087582952 -2.507352e-02 0.0866408510 -0.40533805 0.100924234
## AB -0.031516945 -8.422820e-03 0.1420390221 0.45369414 0.194360344
## TT -0.065431345 -5.303084e-02 0.3174057629 0.14002810 0.936482440
## TN -0.022489056 8.756344e-05 0.1371216760 0.46866893 0.044293815
## DI -0.042754149 1.521460e-01 0.0001774271 -0.01499532 0.037486139
## IR -0.098098851 2.123003e-01 0.0266316683 -0.12687275 0.038153883
## IV -0.043461048 -2.551694e-01 0.0641075614 -0.06329716 0.052501724
## LH 0.028583154 -1.779636e-02 0.2007114714 0.22600401 0.367817571
## CSR 0.063586870 1.007847e-02 0.2949463342 0.91878992 0.074654387
## IL TI CD AD DL
## BA 0.146119308 0.11454641 0.1494039356 0.065679195 0.16222539
## BR 0.183810023 0.15312875 0.1765300836 0.065644674 0.17027870
## CR 0.216236497 0.18666351 0.2481816718 0.064194718 0.19500056
## T -0.113568019 -0.10313240 -0.0838073668 -0.016458723 -0.04149238
## PS -0.296761859 -0.21576618 -0.1006122425 -0.041600592 -0.09775474
## LC -0.072643941 -0.09132475 -0.0499356450 0.022202378 0.01694952
## EF -0.020071906 -0.03238113 -0.0074779453 -0.002616979 -0.01314943
## IB 0.006892974 0.06659607 -0.0554537080 0.038122259 -0.07079840
## CS -0.269385980 -0.29323289 -0.3738819082 -0.065849371 -0.25983065
## TC 0.072628893 0.16840176 0.0676006646 0.050983528 0.14657419
## IL 1.000000000 0.74194993 0.1563415408 0.023968896 0.09032870
## TI 0.741949925 1.00000000 0.1751286458 0.031494616 0.11135223
## CD 0.156341541 0.17512865 1.0000000000 0.340548522 0.37777692
## AD 0.023968896 0.03149462 0.3405485218 1.000000000 0.23327026
## DL 0.090328703 0.11135223 0.3777769220 0.233270261 1.00000000
## PR 0.048872572 0.05606020 0.1116605006 0.041158349 0.08385023
## BU -0.032599094 0.01788041 -0.0437730700 -0.024321355 -0.02948350
## AB -0.004564040 -0.01692251 -0.0924328516 -0.020285733 -0.13448634
## TT 0.075307472 0.17166533 -0.0002363269 0.031958692 0.09387175
## TN -0.119889136 -0.12797625 -0.4587605913 -0.138624920 -0.51644318
## DI 0.024435906 0.02856406 -0.0242645963 -0.019397486 -0.04387671
## IR 0.100565579 0.10171298 0.1230138455 0.005495840 0.05991164
## IV 0.045767194 0.05053806 0.0429200919 0.008977402 0.04120704
## LH -0.101054528 -0.10188945 -0.0218254427 0.042973961 0.08520953
## CSR -0.189599636 -0.22532376 -0.2345581052 -0.055521784 -0.23644920
## PR BU AB TT TN
## BA 0.044094991 0.26143804 -0.348926135 -0.0418938748 -2.413489e-01
## BR 0.051168539 0.25548203 -0.343861143 -0.0482106823 -2.611895e-01
## CR 0.054451158 0.29675023 -0.388140234 -0.0857841144 -3.161791e-01
## T -0.026251691 0.03153535 0.015347737 0.0796503467 1.193419e-01
## PS -0.014884521 -0.24469557 0.318558038 -0.0111307542 1.298937e-01
## LC 0.003167448 -0.08758295 -0.031516945 -0.0654313455 -2.248906e-02
## EF 0.000284718 -0.02507352 -0.008422820 -0.0530308417 8.756344e-05
## IB -0.015016121 0.08664085 0.142039022 0.3174057629 1.371217e-01
## CS -0.083443019 -0.40533805 0.453694143 0.1400280960 4.686689e-01
## TC 0.005523972 0.10092423 0.194360344 0.9364824396 4.429381e-02
## IL 0.048872572 -0.03259909 -0.004564040 0.0753074717 -1.198891e-01
## TI 0.056060197 0.01788041 -0.016922506 0.1716653315 -1.279762e-01
## CD 0.111660501 -0.04377307 -0.092432852 -0.0002363269 -4.587606e-01
## AD 0.041158349 -0.02432135 -0.020285733 0.0319586921 -1.386249e-01
## DL 0.083850226 -0.02948350 -0.134486344 0.0938717472 -5.164432e-01
## PR 1.000000000 -0.02120356 -0.027696759 -0.0086041890 -1.146543e-01
## BU -0.021203560 1.00000000 -0.350830600 0.1005900351 3.926058e-02
## AB -0.027696759 -0.35083060 1.000000000 0.2499170866 2.384296e-01
## TT -0.008604189 0.10059004 0.249917087 1.0000000000 1.220540e-01
## TN -0.114654296 0.03926058 0.238429612 0.1220539905 1.000000e+00
## DI -0.008150796 0.03559958 0.002058548 0.0393176800 5.463910e-02
## IR -0.008929898 0.03497872 -0.005458496 0.0944064561 2.189603e-02
## IV -0.005135338 0.04672484 -0.045080901 0.0584803707 -4.632649e-02
## LH 0.007866744 0.07993760 0.154917210 0.3981152488 6.083840e-03
## CSR -0.061944128 -0.41934418 0.471888435 0.1189872825 3.969301e-01
## DI IR IV LH CSR
## BA 5.632742e-02 -0.055260019 -0.109974504 -2.870723e-02 -0.47893079
## BR 6.291678e-02 -0.033849986 -0.099540159 -5.282283e-02 -0.49440316
## CR 5.011206e-02 0.026704772 -0.067358572 -1.136764e-01 -0.61788683
## T -1.467005e-02 -0.015797041 0.040402465 1.034606e-01 0.10156223
## PS -1.453359e-01 0.020201444 0.154618878 2.176555e-02 0.35879777
## LC -4.275415e-02 -0.098098851 -0.043461048 2.858315e-02 0.06358687
## EF 1.521460e-01 0.212300278 -0.255169351 -1.779636e-02 0.01007847
## IB 1.774271e-04 0.026631668 0.064107561 2.007115e-01 0.29494633
## CS -1.499532e-02 -0.126872751 -0.063297158 2.260040e-01 0.91878992
## TC 3.748614e-02 0.038153883 0.052501724 3.678176e-01 0.07465439
## IL 2.443591e-02 0.100565579 0.045767194 -1.010545e-01 -0.18959964
## TI 2.856406e-02 0.101712983 0.050538058 -1.018894e-01 -0.22532376
## CD -2.426460e-02 0.123013845 0.042920092 -2.182544e-02 -0.23455811
## AD -1.939749e-02 0.005495840 0.008977402 4.297396e-02 -0.05552178
## DL -4.387671e-02 0.059911645 0.041207043 8.520953e-02 -0.23644920
## PR -8.150796e-03 -0.008929898 -0.005135338 7.866744e-03 -0.06194413
## BU 3.559958e-02 0.034978717 0.046724845 7.993760e-02 -0.41934418
## AB 2.058548e-03 -0.005458496 -0.045080901 1.549172e-01 0.47188844
## TT 3.931768e-02 0.094406456 0.058480371 3.981152e-01 0.11898728
## TN 5.463910e-02 0.021896030 -0.046326486 6.083840e-03 0.39693013
## DI 1.000000e+00 -0.077113701 -0.600516568 -6.130953e-05 -0.02037398
## IR -7.711370e-02 1.000000000 0.068199507 -2.421596e-02 -0.06084069
## IV -6.005166e-01 0.068199507 1.000000000 -1.148076e-02 -0.05411182
## LH -6.130953e-05 -0.024215961 -0.011480757 1.000000e+00 0.18279835
## CSR -2.037398e-02 -0.060840692 -0.054111816 1.827983e-01 1.00000000
Select the relationship: Very Strong (0.8 - 1), Strong (0.6 - 0.8), Moderate (0.4 - 0.6), Weak (0.2 - 0.4)
## Var1 Var2 value
## 1 BA BA 1.0000000
## 2 BR BA 0.9898240
## 3 CR BA 0.8718971
## 26 BA BR 0.9898240
## 27 BR BR 1.0000000
## 28 CR BR 0.8756206
## 51 BA CR 0.8718971
## 52 BR CR 0.8756206
## 53 CR CR 1.0000000
## 79 T T 1.0000000
## 105 PS PS 1.0000000
## 131 LC LC 1.0000000
## 157 EF EF 1.0000000
## 183 IB IB 1.0000000
## 209 CS CS 1.0000000
## 225 CSR CS 0.9187899
## 235 TC TC 1.0000000
## 244 TT TC 0.9364824
## 261 IL IL 1.0000000
## 287 TI TI 1.0000000
## 313 CD CD 1.0000000
## 339 AD AD 1.0000000
## 365 DL DL 1.0000000
## 391 PR PR 1.0000000
## 417 BU BU 1.0000000
## 443 AB AB 1.0000000
## 460 TC TT 0.9364824
## 469 TT TT 1.0000000
## 495 TN TN 1.0000000
## 521 DI DI 1.0000000
## 547 IR IR 1.0000000
## 573 IV IV 1.0000000
## 599 LH LH 1.0000000
## 609 CS CSR 0.9187899
## 625 CSR CSR 1.0000000
## Var1 Var2 value
## 5 PS BA -0.6682872
## 30 PS BR -0.6497361
## 55 PS CR -0.7052214
## 59 CS CR -0.6595377
## 75 CSR CR -0.6178868
## 101 BA PS -0.6682872
## 102 BR PS -0.6497361
## 103 CR PS -0.7052214
## 203 CR CS -0.6595377
## 262 TI IL 0.7419499
## 286 IL TI 0.7419499
## 523 IV DI -0.6005166
## 571 DI IV -0.6005166
## 603 CR CSR -0.6178868
## Var1 Var2 value
## 9 CS BA -0.4513276
## 25 CSR BA -0.4789308
## 34 CS BR -0.4834341
## 50 CSR BR -0.4944032
## 201 BA CS -0.4513276
## 202 BR CS -0.4834341
## 217 BU CS -0.4053380
## 218 AB CS 0.4536941
## 220 TN CS 0.4686689
## 320 TN CD -0.4587606
## 370 TN DL -0.5164432
## 409 CS BU -0.4053380
## 425 CSR BU -0.4193442
## 434 CS AB 0.4536941
## 450 CSR AB 0.4718884
## 484 CS TN 0.4686689
## 488 CD TN -0.4587606
## 490 DL TN -0.5164432
## 601 BA CSR -0.4789308
## 602 BR CSR -0.4944032
## 617 BU CSR -0.4193442
## 618 AB CSR 0.4718884
## Var1 Var2 value
## 17 BU BA 0.2614380
## 18 AB BA -0.3489261
## 20 TN BA -0.2413489
## 42 BU BR 0.2554820
## 43 AB BR -0.3438611
## 45 TN BR -0.2611895
## 61 IL CR 0.2162365
## 63 CD CR 0.2481817
## 67 BU CR 0.2967502
## 68 AB CR -0.3881402
## 70 TN CR -0.3161791
## 109 CS PS 0.3696030
## 111 IL PS -0.2967619
## 112 TI PS -0.2157662
## 117 BU PS -0.2446956
## 118 AB PS 0.3185580
## 125 CSR PS 0.3587978
## 172 IR EF 0.2123003
## 173 IV EF -0.2551694
## 184 CS IB 0.3023572
## 185 TC IB 0.2935867
## 194 TT IB 0.3174058
## 199 LH IB 0.2007115
## 200 CSR IB 0.2949463
## 205 PS CS 0.3696030
## 208 IB CS 0.3023572
## 211 IL CS -0.2693860
## 212 TI CS -0.2932329
## 213 CD CS -0.3738819
## 215 DL CS -0.2598307
## 224 LH CS 0.2260040
## 233 IB TC 0.2935867
## 249 LH TC 0.3678176
## 253 CR IL 0.2162365
## 255 PS IL -0.2967619
## 259 CS IL -0.2693860
## 280 PS TI -0.2157662
## 284 CS TI -0.2932329
## 300 CSR TI -0.2253238
## 303 CR CD 0.2481817
## 309 CS CD -0.3738819
## 314 AD CD 0.3405485
## 315 DL CD 0.3777769
## 325 CSR CD -0.2345581
## 338 CD AD 0.3405485
## 340 DL AD 0.2332703
## 359 CS DL -0.2598307
## 363 CD DL 0.3777769
## 364 AD DL 0.2332703
## 375 CSR DL -0.2364492
## 401 BA BU 0.2614380
## 402 BR BU 0.2554820
## 403 CR BU 0.2967502
## 405 PS BU -0.2446956
## 418 AB BU -0.3508306
## 426 BA AB -0.3489261
## 427 BR AB -0.3438611
## 428 CR AB -0.3881402
## 430 PS AB 0.3185580
## 442 BU AB -0.3508306
## 444 TT AB 0.2499171
## 445 TN AB 0.2384296
## 458 IB TT 0.3174058
## 468 AB TT 0.2499171
## 474 LH TT 0.3981152
## 476 BA TN -0.2413489
## 477 BR TN -0.2611895
## 478 CR TN -0.3161791
## 493 AB TN 0.2384296
## 500 CSR TN 0.3969301
## 532 EF IR 0.2123003
## 557 EF IV -0.2551694
## 583 IB LH 0.2007115
## 584 CS LH 0.2260040
## 585 TC LH 0.3678176
## 594 TT LH 0.3981152
## 605 PS CSR 0.3587978
## 608 IB CSR 0.2949463
## 612 TI CSR -0.2253238
## 613 CD CSR -0.2345581
## 615 DL CSR -0.2364492
## 620 TN CSR 0.3969301
From the correlation figure, we can see that there’s very strong relationship between variable BorrowerAPR and CreditRating, this meets our expectation, the higher CreditRating is, the lower BorrowerAPR should be.
Moreover, there’s strong relationship between variable BorrowerAPR and ProsperScore while meanwhile CreditRating also has strong relationship with ProsperScore.
BorrowerAPR has moderate relationship with CreditScoreRange while CreditRating also has moderate relationship with CreditScoreRange. Intesting, why just moderate?
BorrowerAPR has weak relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Meanwhile, they have weak relationship between each other. Moreover, they have moderate relationship with CreditScoreRange.
InquiriesLast6Months has strong relationship with TotalInquiries, reasonable.
Intesting, IncomeVefiable has strong relationship with DebtToIncomeRatio, why?
TradesNeverDelinquent has moderate relationship with CurrentDelinquencies, DelinquenciesLast7Years, which makes sense.
lots of weak relationship.
The created variables’ value is not that obvious. CreditScoreRevision imporves a little compared CreditScoreRange. Will use CreditScoreRevision for the following analysis.
have not found the variables that are related to BorrowerAPR while not related to CreditRating
One idea bingo, want to know whether the orgination fee changes with CreditRating, so we will build one variable OrginationFee and visulize it.
We can see that the orgination fee also changes with the CreditRationg level. The obvious diff is between level ‘AA’ and level “A”.
The result meets our expectation as mentioned before, However, why there are lots of outliers and the variance is not small? it seems there are still other variables control the BorrowerAPR trendency, their influence can not be ignored.
Keep CreditRating fixed, chech the influence of ProsperScore.
As the correlation value calculated before, strong relationship between BorrowerAPR and ProsperScore.
Plot the relationship between ProsperScore and CreditRating. Confused, how to get the CreditRating? how to calculate the ProsperScore?
From the figure, we can see, part of ‘AA’ borrowers still have high risk score, e.g., 4. Therefore, we still should keep the ProsperScore feature, it can descirbe a different dimension for BorrowerAPR prediction.
## [0-19] [360-379] [420-439] [440-459] [460-479] [480-499] [500-519]
## 133 1 5 36 141 346 554
## [520-539] [540-559] [560-579] [580-599] [600-619] [620-639] [640-659]
## 1593 1474 1357 1125 3602 4172 12199
## [660-679] [680-699] [700-719] [720-739] [740-759] [760-779] [780-799]
## 16366 16492 15471 12923 9267 6606 4624
## [800-819] [820-839] [840-859] [860-879] [880-899] NA's
## 2644 1409 567 212 27 591
The sample number in both tail side is not enough, e.g., [0 - 440]. The score is got from customer credit rating agency, still that quesion, how to get credit rating?
From the figure, we can see CreditRating ‘AA’ borrowers may have a low CreditScore, why? Whatever, CreditScoreRange is still one important feature for the prediction.
Try to reduce the levels for CreditScoreRange to check whether can imporve the correlation value by this way.
## (0,640] (640,680] (680,720] (720,760] (760,800] (800,840] (840,880]
## 26605 32858 28394 15873 7268 1976 239
## NA's
## 724
Emmm, more clear than CreditScoreRange.
If understand correctly, this BankcardUtilization should mean ratio of your credit card balances to credit limits. The higher the BankcardUtilization, the higher BorrowerAPR, because high BankcardUtilization will make lender to think that there’s an increased risk.
Bases on before correlation calculation output, we know AvailableBankcardCredit should have weak relationship with BorrowerAPR, this figure shows this. With present knowledge, credit limits = credit balance + credit pending transaction + avaliable credit. This feature can show the credit limits inderectly.
The higher radesNeverDelinquent..percentage., the lower BorrowerAPR.
This should be clear now, Employed may get a low Borrower APR, but why the correlation value is low between BorrowerAPR and EmployedFlag? If we check more carefully, the tail below 25% seems long.
0 not reasonable, samples are not enough. What’s the difference between ‘$0’ and ‘not employed’
50% DebtToincome is 10.1 when IncomeVerifiable is false, this is the max value for DebtToincome, what’s the meaning of 10.1? it seems that the strong relationship between DebtToIncomRatio and IncomeVerifiable is meanless and not useful for our objective.
investigation. How did the feature(s) of interest vary with other features in
the dataset?
There’s very strong relationship between variable BorrowerAPR and CreditRating, this meets our expectation, the higher CreditRating is, the lower BorrowerAPR should be.
Moreover, there’s strong relationship between variable BorrowerAPR and ProsperScore while meanwhile CreditRating also has strong relationship with ProsperScore. Moreover, from the bar plot, part of ‘AA’ borrowers still have high risk score, e.g., 4. Therefore, we still should keep the ProsperScore feature, it can descirbe a different dimension for BorrowerAPR prediction.
BorrowerAPR has moderate relationship with CreditScoreRange while CreditRating also has moderate relationship with CreditScoreRange. Moreover, we can see CreditRating ‘AA’ borrowers may have a low CreditScore, why? Whatever, CreditScoreRange is still one feature for the prediction.
BorrowerAPR has weak relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Meanwhile, they have weak relationship between each other. Moreover, they have moderate relationship with CreditScoreRange.
What confuses me is that, where we get the CreditRating, ProsperScore? All these features should combine the important credit information, like, payment history, credit utilization, length of creit history, new credit, credit mix etc, all of them are history feature.
(not the main feature(s) of interest)?
InquiriesLast6Months has strong relationship with TotalInquiries, reasonable. TradesNeverDelinquent has moderate relationship with CurrentDelinquencies, DelinquenciesLast7Years, which makes sense. All these prove that the history can predict the present status.
IncomeVerifiable has strong relationship with DebtToIncomeRatio, then find that when IncomeVeriiable is False, 50% DebtToIncomeRatio is max value 10.1, do not know the meaning of this value, this relathionship should be not useful for our objective.
BorrowerAPR and CreditRating
We know there’s weak relationship between CreditRating and BankcardUtilization, the figure proves that, the most obvious one is that ‘AA’ borrowers tend to have small BankcardUtilization, ‘HR’ and ‘E’ borrowers tend to have big BankcardUtilization.
From the figure, we can see the weak relationship between TradesNeverDelinquent..percentage. and CreditRating, more ‘HR’ borrowers have small TradesNeverDelinquent..percentage. compared with ‘AA’ borrower. Moreover, we can see that if keep the creditRating fixed, smaller TradesNeverDelinquent..percentage tends to have bigger BorrowerAPR.
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
BorrowerAPR has very strong relationship with CreditRating, strong relationship with ProsperScore, moderate relationship with CreditScoreRevision, weak relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Moreover, all these features have strong, moderate, weak relationship with CreditRating. However, if we dig deeper, can find each of them can describe a different dimension view of data. Where, wonder how to get the CreditRating value? calculated with history data, which includes payment history, credit utilization, length of creit history, new credit, credit mix etc? how to get the ProsperScore? how to get CreditScoreRange? what’s their difference?
lots of features have strong and moderate relationship with CreditRating. If all these features, i.e., CreditRating, ProsperScore, CreditScoreRevision, take account of history data, why they are different?, why we need them all?
strengths and limitations of your model.
Want to build a model to predict the BorrowerAPR, will do this in future, can use Machine Learning Algrithms.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229 25
From this figure, we can see several peaks, the most frequent one is around 0.2 and another bigger peak is around 0.36
The higher CreditRating is, the lower BorrowerAPR will be. However, the variance for each CreditRating level is not small, There are still some other imported features to control the finnal BorrowerAPR. From the trendency, it seems that one linear model can be built to predict the BorrowerAPR.
From the figure, we can see the weak relationship between TradesNeverDelinquent..percentage. and CreditRating, more ‘HR’ borrowers have small TradesNeverDelinquent..percentage. compared with ‘AA’ borrower. Moreover, we can see that if keep the creditRating fixed, smaller TradesNeverDelinquent..percentage tends to have bigger BorrowerAPR.
This is one loan data set, it takes me huge time to understand each variable, which includes understand the variables based on excel data description, search the related knowledge of credit loan and Prosper WebBank etc.
Then i try to explore the data by Univariate Plot and Bivariate Plot, begin to know what i can find based on the data. BorrowerAPR, yes, that’s what borrower and lender most care about. For borrowers, they want to know how can reduce the Borrower APR; For lenders, they want to know, what kind of loan will help them make lots of money and minimize their loss. I find lots of information in Bivariate Plot part. When the correlation matrix is calculated, i compare the value with the plot and finally totally understand what’s going on here.
In order to predict the BorrowerAPR better, i create some new variables, e.g, CreditRating, CreditScoreRange, CreditScoreRevision, EmploymentFlag, LengthHistory. However, the influence is not that obvious, just the relationship between BorrowerAPR and CreditScoreRevision improves a little compared with the relationship between BorrowerAPR and CreditScoreRange.
BorrowerAPR has very strong relationship with CreditRating, strong relationship with ProsperScore, moderate relationship with CreditScoreRevision, weak relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Where, CreditRating has strong relationship with ProsperScore, moderate relationship with CreditScoreRevision. However, ProsperScore and CreditScoreRevision can describe the different view with CreditRating, therefore, they should be both features in this prediction model. Moreover, CreditScoreRevision has moderate relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Have not found the features that are not related to CreditRating but have relationship with BorrowerAPR, which makes sense, since all the score data (i.e., CreditRating, ProsperScore, CreditScoreRevision) has already taken account of all the history data.
I am still confused, how to get the CreditRating value? Calculated with history data, which includes payment history, credit utilization, length of creit history, new credit, credit mix etc? how to get the ProsperScore? Why so many score data in this data set?
Lots of questions have not been answers, need more clues to answer these questions. In other words, still not clear for part of data in the dataset. This part information can not be got by exploring data, should contact the data collectors for more details.
In the future, will build the prediction model to predict the BorrowerAPR. Will split the dataset to training data and testing data, build one model using Machine Learning Algrithms.